The dataset we choose was a subset of the "Theater History of Operations", which is a dataset that tracks every ariel bombing mission recorded from WW1 to Vietnam. This particular data set contains records of mainly US and Royal Air Force mission who did the majority of the ariel bombing throughout the war. It is in a CSV format and in a few seprate files that contain tables of information about the various data. It can be found here.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import folium
from folium.plugins import HeatMap
data = pd.read_csv("data/THOR_WWII_DATA_CLEAN.csv", sep=",", encoding="latin-1") # For some reason the CSV did not like to be loaded without specifiying this encoding
data
/home/froth/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3441: DtypeWarning: Columns (14,48,49,51,52,59,60) have mixed types.Specify dtype option on import or set low_memory=False. exec(code_obj, self.user_global_ns, self.user_ns)
| WWII_ID | MASTER_INDEX_NUMBER | MSNDATE | THEATER | NAF | COUNTRY_FLYING_MISSION | TGT_COUNTRY_CODE | TGT_COUNTRY | TGT_LOCATION | TGT_TYPE | ... | CALLSIGN | ROUNDS_AMMO | SPARES_RETURN_AC | WX_FAIL_AC | MECH_FAIL_AC | MISC_FAIL_AC | TARGET_COMMENT | MISSION_COMMENTS | SOURCE | DATABASE_EDIT_COMMENTS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | NaN | 8/15/1943 | MTO | 12 AF | USA | 13.0 | ITALY | SPADAFORA | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 4285 | 20028.0 | 2/20/1945 | PTO | 5 AF | USA | NaN | PHILIPPINE ISLANDS | PUERTA PRINCESA | UNIDENTIFIED TARGET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 3 | NaN | 8/15/1943 | MTO | 12 AF | USA | 13.0 | ITALY | COSENZA | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 4 | NaN | 8/15/1943 | MTO | 12 AF | USA | 13.0 | ITALY | GIOJA TAURO | NaN | ... | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN |
| 4 | 8167 | 14639.0 | 2/23/1945 | PTO | 5 AF | USA | NaN | PHILIPPINE ISLANDS | BALETE PASS | WOODED AREA | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 178276 | 135311 | NaN | 12/14/1942 | PTO | RAAF | AUSTRALIA | NaN | NEW GUINEA | BUNA AREA - 5 DESTROYERS / MOUTH OF KUMUSI RIVER | ENEMY WARSHIPS | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 178277 | 55692 | NaN | 9/18/1940 | MTO | SAAF | SOUTH AFRICA | 24.0 | ETHIOPIA | YAVELLO | AERODROME | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | DUST CLOUDS IN THE MIDDLE EAST | AUSTEN |
| 178278 | 133019 | NaN | 11/17/1942 | PTO | RAAF | AUSTRALIA | NaN | TIMOR | BAUCAU | BUILDINGS | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 178279 | 178741 | NaN | 2/16/1945 | PTO | RNZAF | NEW ZEALAND | NaN | NEW IRELAND | KAVIENG SWEEP | VARIOUS TARGETS | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 178280 | 122546 | NaN | 6/11/1942 | PTO | RAAF | AUSTRALIA | NaN | NaN | NaN | ENEMY MERCHANT SHIP | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
178281 rows × 62 columns
Lets drop some of those unecessary columns that we won't be using in order to make the data a bit more readable.
data.drop(['CALLSIGN','ROUNDS_AMMO','SPARES_RETURN_AC','WX_FAIL_AC','MECH_FAIL_AC','MISC_FAIL_AC','TARGET_COMMENT','MISSION_COMMENTS','SOURCE','DATABASE_EDIT_COMMENTS'], axis=1, inplace=True)
data.drop(['AC_LOST','AC_DAMAGED','AC_AIRBORNE','AC_DROPPING','TIME_OVER_TARGET','SIGHTING_METHOD_CODE','SIGHTING_EXPLANATION','BDA'], axis=1, inplace=True)
data
| WWII_ID | MASTER_INDEX_NUMBER | MSNDATE | THEATER | NAF | COUNTRY_FLYING_MISSION | TGT_COUNTRY_CODE | TGT_COUNTRY | TGT_LOCATION | TGT_TYPE | ... | NUMBER_OF_FRAG | TYPE_OF_FRAG | LBS_FRAG | TONS_OF_FRAG | TOTAL_LBS | TOTAL_TONS | TAKEOFF_BASE | TAKEOFF_COUNTRY | TAKEOFF_LATITUDE | TAKEOFF_LONGITUDE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | NaN | 8/15/1943 | MTO | 12 AF | USA | 13.0 | ITALY | SPADAFORA | NaN | ... | NaN | NaN | NaN | NaN | NaN | 10.00 | PONTE OLIVO AIRFIELD | SICILY | 37.131022 | 14.321464 |
| 1 | 4285 | 20028.0 | 2/20/1945 | PTO | 5 AF | USA | NaN | PHILIPPINE ISLANDS | PUERTA PRINCESA | UNIDENTIFIED TARGET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 3 | NaN | 8/15/1943 | MTO | 12 AF | USA | 13.0 | ITALY | COSENZA | NaN | ... | NaN | NaN | NaN | NaN | NaN | 9.00 | PONTE OLIVO AIRFIELD | SICILY | 37.131022 | 14.321464 |
| 3 | 4 | NaN | 8/15/1943 | MTO | 12 AF | USA | 13.0 | ITALY | GIOJA TAURO | NaN | ... | NaN | NaN | NaN | NaN | NaN | 7.50 | PONTE OLIVO AIRFIELD | SICILY | 37.131022 | 14.321464 |
| 4 | 8167 | 14639.0 | 2/23/1945 | PTO | 5 AF | USA | NaN | PHILIPPINE ISLANDS | BALETE PASS | WOODED AREA | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 178276 | 135311 | NaN | 12/14/1942 | PTO | RAAF | AUSTRALIA | NaN | NEW GUINEA | BUNA AREA - 5 DESTROYERS / MOUTH OF KUMUSI RIVER | ENEMY WARSHIPS | ... | NaN | NaN | NaN | 0.0 | 4500.0 | 2.25 | NaN | NaN | NaN | NaN |
| 178277 | 55692 | NaN | 9/18/1940 | MTO | SAAF | SOUTH AFRICA | 24.0 | ETHIOPIA | YAVELLO | AERODROME | ... | NaN | NaN | NaN | NaN | 0.0 | 0.00 | NAIROBI | KENYA | -1.283300 | 36.816700 |
| 178278 | 133019 | NaN | 11/17/1942 | PTO | RAAF | AUSTRALIA | NaN | TIMOR | BAUCAU | BUILDINGS | ... | NaN | NaN | NaN | 0.0 | 4800.0 | 2.40 | BATCHELOR FIELD | AUSTRALIA | NaN | NaN |
| 178279 | 178741 | NaN | 2/16/1945 | PTO | RNZAF | NEW ZEALAND | NaN | NEW IRELAND | KAVIENG SWEEP | VARIOUS TARGETS | ... | NaN | NaN | NaN | NaN | 8000.0 | 4.00 | NaN | NaN | NaN | NaN |
| 178280 | 122546 | NaN | 6/11/1942 | PTO | RAAF | AUSTRALIA | NaN | NaN | NaN | ENEMY MERCHANT SHIP | ... | NaN | NaN | NaN | 0.0 | 1000.0 | 0.50 | DARWIN | AUSTRALIA | NaN | NaN |
178281 rows × 44 columns
Now that we have loaded the data lets do some basic analysis! Lets see how many tons of bombs the allies dropped over the corse of the war. We can do this easily by specifying the column containing the total tons of weaponry dropped and calling sum on that column.
total_tons = data["TOTAL_TONS"].sum()
total_tons
4268503.783999999
That is a pretty isane amount of weaponry dropped over the corse of WW2. Now lets try to recreate that graph on the website we got the data from that showed which countries had the most tons of weaponry dropped on them.
# We want to group the data by country being bombed, sum the total tons dropped, and sort by decreasing order. Then we to reset
# the index so we can query by TGT_COUNTRY and then grab the first 5 elements to graph
per_country = (data.groupby(["TGT_COUNTRY"]).sum()).sort_values(by="TOTAL_TONS", ascending=False).reset_index().head(5)
plt.figure(figsize=(20,10)) # Set image size
bar_graph = sns.barplot(x="TGT_COUNTRY",y="TOTAL_TONS", data=per_country)
# This line makes it so we don't use scientific notation for the y axis
bar_graph.ticklabel_format(style="plain", axis="y")
# Set labels
bar_graph.set_xlabel("Target Country")
bar_graph.set_ylabel("Tons of bombs")
bar_graph.set_title("Tons of bombs dropped on each country")
Text(0.5, 1.0, 'Tons of bombs dropped on each country')
Wow! That is a lot of bombs dropped on Germany. Now lets now see who was dropping those bombs. We can do a similar strategey as the last graph except this time we will group by the country flying the mission instead of the target country.
most_dropped = data.groupby(["COUNTRY_FLYING_MISSION"]).sum().sort_values(by=["TOTAL_TONS"],ascending=False).reset_index()
plt.figure(figsize=(20,10))
bar_graph = sns.barplot(x="COUNTRY_FLYING_MISSION",y="TOTAL_TONS", data=most_dropped)
bar_graph.ticklabel_format(style="plain", axis="y")
bar_graph.set_xlabel("Country dropping")
bar_graph.set_ylabel("Tons of bombs")
bar_graph.set_title("Tons of bombs dropped by each country")
Text(0.5, 1.0, 'Tons of bombs dropped by each country')
The USA and Great Britan unsuprisingly did the most dropping of bombs. Now lets find which planes did most of this bomb dropping.
# Lets group the data by aircraft name and see which aircraft has dropped the most tons of bombs
aircrafts = data.groupby(["AIRCRAFT_NAME"]).sum().sort_values(by=["TOTAL_TONS"],ascending=False).reset_index().head(9) # Lets get the top 9 and leave room for an other category
# Create percentage col to use for piechart
aircrafts["PERCENTAGE_DROPPED"] = (aircrafts["TOTAL_TONS"]/total_tons)*100
# Calculate remaining
other_percentage = 100 - aircrafts["PERCENTAGE_DROPPED"].sum()
# Append other to the dataframe
aircrafts = aircrafts.append({"PERCENTAGE_DROPPED" : other_percentage,"AIRCRAFT_NAME" : "Other" }, ignore_index=True)
# plot
plt.figure(figsize=(20,10))
colors = sns.color_palette()[0:10]
plt.pie(aircrafts["PERCENTAGE_DROPPED"], colors=colors, labels=aircrafts["AIRCRAFT_NAME"], autopct="%.1f%%")
plt.title("Percentage of all bombs dropped by aircraft")
plt.show()
The B17 leads this category having dropped almost 30% of all bombs dropped by the US and Great Britan over the war. This is impressive but not unexpected as the B17 was one of the most mass produced and effictive bombers of the war. Britanica states that the B17 was "was the mainstay of the strategic bombing campaign" for the US.
Now lets try to get a nice overview of the amount of bombing that occured over time.
data["DATE_TIME"] = pd.to_datetime(data["MSNDATE"], format="%m/%d/%Y") # Convert to datetimes for ease of use
by_time = data.groupby(["DATE_TIME"]).sum().reset_index()
plt.figure(figsize=(20,10))
line_graph = sns.lineplot(x="DATE_TIME",y="TOTAL_TONS", data=by_time)
line_graph.set_xlabel("Time")
line_graph.set_ylabel("Tons of bombs")
line_graph.set_title("Tons of bombs over time")
Text(0.5, 1.0, 'Tons of bombs over time')
Now lets bring in folium to do some visualizing of what areas were bombed the most. Folium will let us create an interactive heatmap in order to see what areas were most bombed by the allies throughout the war. For viewer convience we will lable the capital cities of the main Axis powers.
# Capital Axis cities LATITUDE, LONGITUDE
berlin = 52.520008,13.404954
tokyo = 35.652832, 139.839478
rome = 41.902782, 12.496366
map_osm = folium.Map(location=berlin, zoom_start=5) #Create a map centered on Berlin
valid_lat = data.groupby(data["LATITUDE"].isnull()).get_group(False) # Make sure latitude is not NaN
valid_lat_long = valid_lat.groupby(data["LONGITUDE"].isnull()).get_group(False) # Make sure longitutde is not NaN
heat_map_data = [[row["LATITUDE"],row["LONGITUDE"],] for index, row in valid_lat_long.iterrows()] # Iterate through valid points and add to a list
HeatMap(heat_map_data).add_to(map_osm) # Add the list of points to the map as a heatmap
# Label the capital Axis cities
folium.Marker(location=berlin,popup="<b>Berlin</b>").add_to(map_osm)
folium.Marker(location=tokyo,popup="<b>Tokyo</b>").add_to(map_osm)
folium.Marker(location=rome,popup="<b>Rome</b>").add_to(map_osm)
# Show map
map_osm